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Chapter  I 
IMTRODUCTIOM 


This  note  describes  the  method  used,  and  the  results 
obtained  for  simulating  an  existing  weather  forcasting  program  of 
Rivas[6]  with  the  W  ASHCLOTH  [  1  , 2  ]  siraulator.  The  V/ASHCLOTH 
simulator  is  used  to  investigate  how  FORTRAN  programs  would 
behave  on  an  idealized  MIMD  machine,  the  paracompu t er [ 1 2  ]  ,  which 
will  be  approximated  by  the  NYU  Ul t racoapu t er [ 5 ] . 

The  meiao  sheds  light  on  the  problems  encountered  in  taking 
existing  codes  and  running  them  on  the  Ultracomputer. 


This  work  was  supported  in  part  by  NASA  under  grant  NSG-503^,  by 
NSF  under  grant  number  NSF-MCS7 9-07 80 4 ,  and  by  DOE  under  grant 
DE-AC02-76En03077 . 


-  1  - 


CIIAPTEH  2 
PROBLEII  DESCniPTIOn 


An  earlier  paper[11]  reported  on  tue  sii.iulation  of  a 
ci/o-dii:ien3ional  at.jospneric  iaodel  code  for  v.'eather  I'orcastini,;. 
The  success  of  tnat  project  was  the  iupetus  for  takiHi^  a  coupiete 
three-diiaensional  atuospneric  rjodel  code  used  for  weather 
forcasting,  and  siuulating  it  under  V/ASHCLOTH.  Some  details  of 
this  iaodel  are  reproduced  in  the  Appendix  from  reference  [6]. 

The  code  we   received   from   MASA/Goddard ,   was  written   in 

FORTRAiJ   for  an  AIIDAHL  coijputer.   The  code  consisted  of  more  than 

7000  lines  and  used  raost  of  the  16 !1 -bytes  of   mewory  present   in 
the  A:1DA11L. 


The  code  generates  solutions  to  a  global  circulation  model 
of  the  atiiiosphere  by  a  finite  difference  scheiae  wuich  couputes 
values  of  several  interrelated  state  variables  on  a 
three-dimensional  grid  encircling  the  globe.  The  equations  of 
notion  are  listed  in  the  Appendix.  A  region  of  fixed  altitude  is 
called  a  layer.  The  model  uses  a  fairly  small  set  of  layers 
which  is  usually  a  fixed  property  of  the  model.  The  code  we 
received   was   a   nine-layer   model.    The   grid  on  each  layer  is 
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uniformly  distributed  in  latitude  and  longitude. 

After  a  [^iven  nuuber  of  iterations,  the  solution  is  used  to 
recompute  certain  physical  quantities  at  each  grid  point.  These 
computations  are  alj^ebraic  in  nature  and  can  clearly  be  performed 
at  each  grid  point  in  parallel.  It  was  decided  to  remove  this 
part  of  the  code  fror.i  our  study,  since  v;e  felt  that  paralleliziHij; 
it  would  be  easy  and  would  not  yield  any  new  information. 
Furthermore,  the  omitted  code  would  De  very  efficient  so  that  the 
estimates  of  efficiency  we  report  are  conservative. 

Given  initial  data  for  the  state  variables,  their  values  at 
a  future  time  are  determined  by  this  procraa.  The  difference 
equations  are  used  to  march  ahead  in  time.  For  stability 
reasons,  the  solution  method  being  explicit,  the  size  of  a  time 
step  is  limited  by  the  spatial  distance  between  adjacent  points 
on  the  coordinate  grid  system.  Thus,  to  compute  the  weather  at  a 
specific  time,  more  iterations  have  to  be  performed  when  using 
finer  grids  and  hence,  for  a  fixed  number  of  layers,  refining  the 
grid  by  a  factor  of  two  increases  the  serial  computing  time  by  a 
factor  of  eight. 


Special  problems  occur  near  the  poles  since  the  grid  is 
evenly  spaced  in  latitude  and  longitude  and  hence  the  spacial 
distance  between  adjacent  polar  grid  points  is  small  and 
instabilities  would  result  if  the  time  step  were  not 
correspondingly  small.  The  larger  time  step  actually  used, 
results  in  amplification  of  the  high  frequency  harmonics  of  the 
discretization  error.   Smoothing  of  the  solution  near   the   polar 
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region,    which    decreases    the   aaplitude   of   high   frequency 
haruonics,  allows  for  larj^er  time  steps  without  instabilities. 

The  NASA  code  offers  a  choice  of  two  smoothing  routines  for 
use  in  the  polar  regions.  One  method  uses  Fourier  filtering  to 
daapen  the  higher  frequency  harmonics.  The  other  method  uses 
weighted  averages  of  neighboring  points.  In  both  of  these 
uetiiods,  the  suoothing  is  only  perforiiied  on  lines  of  latitude 
near  the  poles.  The  degree  of  sraoothing  increases  mono  tonical  ly 
towards  the  polar  regions. 
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CHAPTER    3 
GETTiriG    THE    SERIAL    CODE    TO    RUN 


Since  WASHCLOTH  runs  on  the  CDC-6600,  and  not  on  the  AMDAHL, 
it  was  first  necessary  to  ij,et  the  serial  code  running  on  the 
CDC-6600.  This  proved  to  be  the  most  difficult  part  of  the  whole 
simulation  process.  The  CDC-6600  has  a  very  snail  nemory  by 
present  standards,  128K  words  or  about  IM-byte.  Obviously,  it 
was  necessary  to  run  the  program  with  much  smaller  grid  sizes. 
Changing  grid  sizes  proved  to  be  non-trivial  because  many  of  the 
arrays  were  equivalenced  to  save  space.  Cutting  the  array 
dimensions  made  some  of  these  equivalences  impossible.  Also,  the 
initial  data  had  to  be  generated  for  points  on  the  reduced  grid. 

In  addition  to  the  severe  space  problem,  there  were 
incompatibilities  in  the  versions  of  FORTRAN  on  the  two  machines. 

The  FTN  compiler  on  the  CDC-6600  allows  only  one,  two,   or  three 

dimensi'onal    arrays.     The    program   we   converted   used  four 

dimensional  arrays.   The  fourth  dimension  was  used  to  group  four 

variables   rather   than  having  'a  separate  three-dimensional  array 
for  each  of  them. 
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The  initial  data  had  do  be  interpolated  to  get  values  at  our 
grid  points.  When  the  proi^rava  was  running  on  the  CDC-6600,  there 
was  no  way  to  verify  that  it  had  been  done  correctly.  The  i^rid 
was  too  crude  to  generate  realistic  solutions.  Once  v/e  were  able 
to  verify  that  all  portions  of  the  code  were  exercised  and  did 
not  generate  obviously  erroneous  results  we  certified  this  code 
as  the  CDC-6600  serial  code  that  we  were  to  parallelize  and 
simulate.  The  results  of  the  parallel  code  had  to  agree  with 
this  benchmark  serial  code. 
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CHAPTER  1 
PARALLELIZIIIG  THE  CODE 


Our  plan  v;as  to  use  the  D0\JAIT[9]  construct  to  parallelize 
the  code  whenever  possible.  This  construct  allows  one  and  two 
dimensional  DO  loops  for  which  the  calculations  for  each  of  the 
values  of  the  loop  indices  are  independent  to  be  performed 
siaul taneously .  Ssynchronization  occurs  at  the  end  of  each  loop. 
The  code  outside  DOUAIT  regions  is  executed  by  one  processin-^ 
element,  PE,  whereas  inside  the  DOWAIT,  as  many  processing 
elements  as  are  available  are  used  as  assistants. 

The  DOVJAIT  code  can  be  run  with  or  without  M0P[3].  VJhen 
used  without  MOP,  each  PE  except  the  one  running  the  serial  code, 
busy  waits  outside  of  DOWAIT  loops.  The  DOWAIT  code  opens  a 
"gate"  for  these  PE's  and  within  a  few  instructions  they  perform 
independent  loop  iterations. 

When  using  MOP,  one  PE  executes  the  serial  code  and  the 
other  PE's  are  available  for  other  jobs  rather  than  busy  waiiting. 
Upon  entering  the  DOWAIT,  the  processor  makes  an  operating  system 
request  for  helpers.  As  the  loop  iterations  complete,  each  of 
these  helpers  complete  and  is   released   for   scheduling   by   the 
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operatlns   system.    The  number  of  cycles  to  initiate  a  helper  is 

substantially  higher  when  using   MOP,   but   no   busy  waiting   is 

needed    while    executing    serial    code    or    for  perforaiing 
synchronization. 

Several  changes  were  required  to  convert  the  serial  code  to 
a  form  where  it  consisted  of  segments  suitable  for  the  DO  17 AIT 
construct.  The  code  consistently  used  index  names  starting  with 
I  for  longitude,  J  for  latitude,  and  L  for  altitude,  and  we  will 
use  this  convention  freely  as  well. 


Parts  of  code  required  the  solution  at  one  value  of  L  to 
generate  the  value  for  L+1,  primarily  because  of  quadrature 
formulas.  The  values  for  I  and  J,  at  a  fixed  L,  could  be 
determined  from  values  of  state  variables  from  the  previous 
iteration,  except  for  the  smoothings.  If  the  code  were  organized 
as  a  sequence  of  loops  with  the  altitude  calculations  in  the 
inner-most  loop,  then  each  of  the  two  outer  loops  for  I  and  J 
could  be  replaced  by  a  sequence  of  DOV/AIT  loops  ranging  over  both 
latitude  and  longitude.  Optimizations  performed  by  serial 
compilers  could  still  be  applied  to  the  inner  L  loop.  Note  that 
this  organization,  placing  serial  loops  inside  parallel  loops,  is 
the  reverse  of  that  which  is  suitable  for  vector  processing. 

The  original  code  does  not  have  this  consistent 
organization.  The  organization  varied  from  place  to  place.  This 
was  primarily  because  certain  loop  orderings  require  more 
temporary  space  than  others.  The  loops  were  rearranged  as 
described   above   whenever   possible.    The   resulting   code   was 
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checked  against  the  serial  version. 

A  technique  frequently  referred  to  as  "banding"  v/as  used  in 
the  original  code  to  save  space  for  temporary  variables. 
Intermediate  results  were  computed  and  saved  in  a  band  which 
changes  as  the  iteration  proceeds.  In  this  code,  banding  was 
done  on  J  so  that  several  temporaries  variables  were  saved  for  J, 
J-1,  and  J-2  only.  The  solution  is  computed  sequentially  from 
J=1  to  J=JMAX  and  each  time  J  is  incremented  the  J-2  part  of  the 
temporary  matrices  are  replaced  by  J-1,  the  J-1  elements  replaced 
by  J,  and  the  new  J  elements  are  computed. 

The  banding  technique  saves  memory  but  introduces 
non-essential  serialization  which  requires  that  latitudes  be  done 
sequentially.  In  the.  parallel  version  of  the  code,  banding  was 
replaced  by  two  DOUAIT  loops.  The  first  loop  computed  and  stored 
all  the  temporary  values  at  all  the  grid  points.  The  second 
performed  the  updates. 


The  smoothing  portions  were  also  parallelized.  In  this 
case,  the  smoothing  for  each  latitude  and  each  layer  were  done 
simultaneously.  The  amount  of  computation  varies  with  latitude 
but  this  is  not  a  problem  since  no  assumption  is  made  that  each 
sub-task  be  of  the  same  size.  Since  the  number  of  layers  is 
small,  the  degree  of  parallelism  is  less  here  than  other  parts  of 
the  program.  For  Fourier  filtering,  several  PE ' s  could  have  been 
assigned  to  each  of  these  sub-tasks [ 1 0  ]  .  This  was  not  done 
because  the  filtering  represented  a  relatively  small  portion  of 
the  computing  time  and  limitations  of  the  WASHCLOTH  emulator  made 
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nested  DOUAIT's  rather  difficult 
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CHAPTER  5 
RESULTS 


Because  of  the  small  meraory  size  of  the  CDC-6600,  it  was  not 
possible  to  simulate  realistic  ^rid  sizes.  The  srid  size  was  set 
to  24x16x3  and  simulation  runs  were  aade  for  varying  number  of 
processing  elements.  Table  1  shows  the  timing  results  for 
performing  one  iteration  of  the  code  without  HOP.  The  first 
entry  shows  the  number  of  instructions  for  the  serial  version  of 
the  code. 

Each  iteration  consisted  of  55  DOWAIT  loops.  At  the  grid 
that  was  used,  this  took  12,208  loop  instances.  The  increase  in 
of  about  700,000  instructions  for  the  parallel  version  is  almost 
exclusively  due  to  the  extra  instructions  involved  in  selecting  a 
loop  iteration.  In  this  case  there  are  about  50  instructions 
required  to  initiate  a  loop  iteration.  Part  of  the  reason  that 
it  took  50  instructions  to  initiate  an  iteration  is  that  a 
general  purpose  subroutine  was  called  rather  than  having  a 
compiler  that  could  generate  inline  code.  On  the  other  hand,  had 
we  run  under  MOP,  this  number  might  have  been  substantially 
higher.  Since  the  time  to  initiate  a  DOWAIT/DOALL  iteration 
depends   on   the   implementation,   it   is   helpful  to  see  how  the 
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results  depend  on  the  number  of  instructions  needed  to  initiate 
an  iteration.  Table  2  shows  how  the  number  of  instructions  for 
loop  initialization  effects  efficiency.  Since  the  number  of 
instructions  per  loop  iteration  increases  as  we  increase  the 
number  of  layers,  the  efficiency  would  increase  with  more  layers. 
Since  we  simulated  only  a  three  layer  atmospheric  model,  our 
results  are  conservative. 

The  version  of  VJASHCLOTH  used  for  the  simulations  allowed 
only  32  PE's  to  be  simulated.  The  methodology  needed  to 
extrapolate  these  results  to  larger  grids  and  more  processini^ 
elements  was  presented  in  reference  [8].  Since  the  computation 
is  dominated  by  two  dimensional  DOWAIT  loops,  doubling  the  number 
of  latitude  and  longitude  grid  points  should  allow  four  times  as 
□any  processing  elements  to  be  used  with  the  same  efficiencies  as 
in  Table  1.  Increasing  the  number  of  layers  would  increase  the 
number  of  instructions  in  many  of  the  DO V/ AIT  loops  and  therefore 
should  also  improve  efficiency  substantially.  By  scaling  up  to 
realistic  grid  sizes,  these  results  can  be  used  to  predict  the 
behavior  of  a  system  of  4096  processing  elements.  Runs  commonly 
made  today,  use  grid  sizes  with  12  times  as  many  points  around 
the  globe  and  three  times  the  number  of  layers,  so  that  about  500 
processing  elements  could  in  that  case  be  used  higher  efficiency 
than  reported  here  for  32  processing  elements.  I7ith  finer  grid 
sizes  all  4096  processing  elements  could  be  used  to  achieve 
speedups  of  greater  than  3000  over  the  speed  of  sequential 
execution. 
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THREE-DIMENSIONAL  WEATHER  CODE 
24   X   16  X  3 


)?PE 

T 

Speedup 

Eff 

iciency 

1* 

6105K 

1 

6837K 

1  .00 

1  .00 

2 

3431K 

1.99 

1  .00 

it 

1749K 

3.91 

.98 

S 

904K 

17.56 

.95 

16 

493K 

113.88 

.07 

32 

289K 

123.64 

.74 

*  Serial 

version  of 

the  code 

Table  1 

THREE-DIMENSIONAL  WEATHER  CODE 
24   X   16  X  3 


NUMBER  OF  INITIALIZATION  INSTRUCTIONS 
10        50     100         500     1000 


EFFICIENCY 


98 


.91 


Table  2 
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CHAPTER  6 
COMCLUSIOWS 


It  is  clear  frora  these  results  that  three-dimensional 
weather  calculations  are  well  suited  for  the  Paracomputer  and 
presumably  the  NYU  Ul tracomput er . 

The  amount  of  surgery  on  the  code  that  was  needed  proved  to 
be  higher  than  was  initially  expected,  but  was  for  the  most  part 
associated  with  porting  to  the  CDC-6600,  rather  that  with 
parallel ization  per  se. 

The  simulations  point  up  the  fact  that  to  attain  high 
efficiency,  the  number  of  instructions  needed  to  bring  in 
"helper"  processing  elements  should  be  kept  small  since  the 
number  of  synchronization  points  per  iteration  is  fairly  high. 
Thus,  in  a  real  system,  one  mode  of  process  creation  should  take 
no  more  that  about  100  instructions,  not  thousands. 
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APPENDIX 
Pri-itive    Equations    of    Motion  ULTRACOMPUTER   NOTE   #55 

i.&2.   Horizontal  momentum  equations 

^^        CZ  a  t  O  O 

=     --V0   -    -0—         VtT-   (f4U^^'^)      [kXTTV     +     7T(F 

XT  ^  - 

3.         Continuity    equation 

(3.1)       l^   +    V-{^)     +    l—{r>h)     =0  or 

I  1 

(3-2)       V^  "^    ~  S'^'  (^V)da    =    -V-    C^^Vda 


4 .  Equation    of    state 

RT 
F 

5.  First    law    of    thermodynamics 

--—    +    7-  (ti\VT)      +    J— (TTOT)     =    — :r-    +    pr^  ('^=H?) 

d  t  3  o  C  C  at 

From    3    =T/p    ,    p=p   +aTT,    Qj^cTT+ira,    TT=j-^V-V7r,      k=R/C  we   get 

^^  =   p      -^^  +   TT—  Replacing   in    5, 

XT 

LIT  ,   vCAVT)^?^  ^=  IL£|I(|Z^,v.v.)    +  ^ 

st  '^da  pdt  L 

6.  Humidity   equation 

^  ^    V-  (-r^q)    =    0 
at. 

7.  Hydrostatic   equation 

^^^ — ^       ~        ^     ^ 
a      K  D 
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Of  the  variables  tt  ,  u  ,  v,  T,  q  ,  (}) ,  a  ,  a  we  update  the  5  primary  variables"^ 

■;r,u,v,T,  and  g  using  equations  1,2,3.2,5  and  6.   Fron  equations 

3.1,  4,  and  7  we  can  obtain  4),  a,  and  c  which  are  our  secondary 

variables.   Note  that  p=a7r+p_^ 

^  ^top 

Sea  level  pressure 

"  J    4-  4- •       -\  3p  1     P.I    /SLP.      (dC 

.^.ydrostatic  eg.  ^  ^-^  =  -p  =  -_  =  -|^  •  log  (-^)  =    -)   ^ 

(p  *s 

:.  SLP  =  p(G=l)  exp(-|) 

RT 


< 
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